Implement 4over6 NVFP4 recipe#2972
Conversation
Greptile SummaryThis PR adds NVFP4 4over6 quantization support to TransformerEngine's
Confidence Score: 5/5The PR is safe to merge. The 4over6 feature is entirely opt-in and isolated behind the new nvfp4_4over6 recipe field; all existing NVFP4 paths are untouched when the flag is unset. The kernel implementation is well-guarded with explicit rejection of stochastic rounding, RHT, and grouped quantization at multiple call-site layers. The two findings are limited to an inconsistency in a property setter that all current internal call sites avoid, and a missing secondary validation inside quantize_4over6 that is already covered by quantize_fwd_helper for all real callers. The new quantize_4over6_nvfp4.cuh kernel and the grouped_tensor_storage.py property setter are the two places worth a careful second read. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Python Recipe NVFP4BlockScaling nvfp4_4over6 scope] --> B[NVFP4BlockScalingRecipeState resolve nvfp4_use_4over6]
B --> C[NVFP4Quantizer nvfp4_use_4over6 bool nvfp4_e4m3_max int]
C --> D{quantize path}
D -- single tensor --> E[quantize_impl set quant_config fields reject RHT + stochastic_rounding]
D -- split quant --> F[split_quantize_nvfp4_impl reject RHT set per-config fields]
D -- grouped --> G[group_quantize_nvfp4_impl reject 4over6]
E --> H[quantize_fwd_helper check tensor/config consistency]
F --> H
H -- nvfp4_use_4over6=true --> I[quantize_4over6 E4M3_MAX switch ErrMode switch]
H -- nvfp4_use_4over6=false --> J[existing quantize_transpose kernels]
I --> K[quantize_4over6_kernel load tile async compute ScalePair map4+map6 pick lower error write selected scale+data]
K --> L[NVFP4Tensor _nvfp4_use_4over6 _nvfp4_e4m3_max]
L --> M[dequantize_fp4_kernel E4M3_MAX template]
L --> N[nvte_nvfp4_compute_per_tensor_scale fp8_max from tensor]
Reviews (8): Last reviewed commit: "Remove 4over6 benchmark" | Re-trigger Greptile |
|
Functionality has been verified by internal RL experiments. |
|
Need to rebase. |
| * its values are populated during quantization. | ||
| */ | ||
| kNVTERowScaledNVFP4 = 8, | ||
| kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */ |
There was a problem hiding this comment.
We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.
There was a problem hiding this comment.
4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.
| using namespace detail; | ||
| constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f; | ||
| constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f; | ||
| constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f; |
There was a problem hiding this comment.
How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.
If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.
There was a problem hiding this comment.
From the original paper:
Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.
Also:
In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.
There was a problem hiding this comment.
Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.
There was a problem hiding this comment.
We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.
It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.
There was a problem hiding this comment.
let me make 256 scaling a separate env var disabled by default
There was a problem hiding this comment.
448, 320, 288, 256 are all potential candidates for map-to-6:
- 448: effectively disable map-to-4 option above 256, preserve range
- 320, 288: map-to-4 uses 448, no precise 1.5x
- 256: map-to-4 uses 384, precise 1.5x
For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.
There was a problem hiding this comment.
NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.
There was a problem hiding this comment.
This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.
There was a problem hiding this comment.
Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.
| nvfp4_4over6 : {None, 'weights', 'activations', 'all'}, default = None | ||
| Select tensors that use NVFP4 4over6. In this mode NVFP4 | ||
| quantization evaluates per-block map-to-4 and map-to-6 candidates | ||
| and chooses the one with lower MSE. Ties choose map-to-6. The |
There was a problem hiding this comment.
We need both MSE (better for post-training?) and MAE (better for pre-training as per our internal studies) to be supported, with MAE as the default.
| using namespace detail; | ||
| constexpr float fp8_max = TypeExtrema<fp8e4m3>::max; // 448.0f; | ||
| constexpr float fp4_max = TypeExtrema<fp4e2m1>::max; // 6.0f; | ||
| constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max; // 448.0f; |
There was a problem hiding this comment.
Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
|
What is the e2e step time increase with 4/6 on some typical workload? |
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
|
Major changes from last time:
|
Description
@HumansAnd
Implement 4over6 nvfp4 from:
FlashInfer PR:
Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the
NVFP4BlockScalingrecipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.
Type of change
Changes
Please list the changes introduced in this PR:
NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.Checklist: